Skip to content

[staging] PR-5351 cross-OS validation (doc extraction + 3 fixes)#89

Open
danielhanchen wants to merge 13 commits into
mainfrom
pr-5351-cross-os-validation
Open

[staging] PR-5351 cross-OS validation (doc extraction + 3 fixes)#89
danielhanchen wants to merge 13 commits into
mainfrom
pr-5351-cross-os-validation

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Staging-only PR to validate unslothai#5351 (Studio document extraction with runtime VLM probe and bounded concurrency) on Ubuntu, macOS-14 and Windows runners.

Contents:

  • Cherry-picks studio: chat document extraction with runtime VLM probe and bounded concurrency unsloth#5351 head (a79102a).
  • Adds three follow-up fixes in 73b0030:
    • GGUF singleton split: consolidated routes.inference on the canonical core.inference.llama_cpp accessor (P1).
    • Extractor semaphore leak: moved the multiprocessing setup inside the try/finally (High).
    • HTML cleanup pulling in inference: switched core.inference to PEP-562 lazy getattr (Medium).
  • Adds three regression tests under tests/studio/.
  • Trims .github/workflows/ to keep below the 5-concurrent-Windows cap.
  • Adds pr5351-ubuntu.yml, pr5351-macos.yml, pr5351-windows.yml lanes.

This branch is not intended to land upstream; the validated changes will be proposed against unslothai#5351 directly.

Etherll and others added 7 commits May 9, 2026 10:53
Expose and propagate the server-side document extraction concurrency limit so clients can cap their parallel extractions and avoid 503 busy responses. Backend: export _EXTRACT_CONCURRENCY, add max_extract_concurrency to DocumentSupportResponse, surface a default on import failure, and compute a larger chat JSON body limit (_OPENAI_CHAT_BODY_MAX_BYTES) that accounts for embedded image payloads. Also create the extraction task waiter once to properly drain exceptions. Tests updated to assert the new values and chat body behavior. Frontend: add setExtractionBackendLimit and apply the backend limit in the extraction queue and UI (cap slider, adjust stored settings), call setExtractionBackendLimit when caching document support, and tweak NDJSON progress wording from “uploaded” to “processed.”
Stub document-parser availability in chat document route tests to avoid spurious 501 errors in CI when optional parsers are missing, by monkeypatching _document_parser_support and _document_parser_unavailable_reasons. Update several AST-based tests to consider both the public handler (openai_chat_completions) and the implementation function (_openai_chat_completions_impl), preferring the impl when present, so structural checks still pass after streaming bodies were moved into the implementation. Changes touch test_chat_document_routes.py and multiple tests in test_stream_cancel_registration_timing.py.
…rence coupling

Three bugs in the document-extraction refactor (PR unslothai#5351) that produced
real user-visible regressions on the unmodified PR head:

1. GGUF singleton split (P1). routes.inference defined an eager
   _llama_cpp_backend = LlamaCppBackend() module-level instance whose
   get_llama_cpp_backend() wrapper shadowed the canonical accessor in
   core.inference.llama_cpp. /api/inference/load wrote to the route
   instance; routes.models.list_models / cache-delete, run.py shutdown
   and core.chat.vlm_capability._probe_gguf all read the core instance.
   So a GGUF loaded the normal way was invisible to /api/models/list,
   deletable from cache while serving, leaked at shutdown, and the VLM
   probe never saw it. Removed the eager instance and wrapper; the
   existing top-of-file import from core.inference.llama_cpp now
   re-exports the canonical accessor.

2. Extractor semaphore leak (High). _run_extract_process_sync acquired
   _EXTRACT_SEMAPHORE before its try/finally block, so if
   multiprocessing.get_context, ctx.Queue or ctx.Process raised
   (OSError on fork-resource exhaustion, EAGAIN on Windows under
   load, Queue allocation failure on hardened sandboxes), the permit
   was leaked and the queue eventually deadlocked. Moved the try
   upwards to cover the multiprocessing setup; the finally now also
   terminates a partially-started worker.

3. HTML cleanup pulled in the inference stack (Medium). _extract_html
   did from core.inference._html_to_md import html_to_markdown.
   core.inference/__init__.py eagerly imported .orchestrator and
   .llama_cpp, so any failure deep in the inference dep chain made
   _extract_html fall back to raw HTML, splicing <script>/<style>
   tags into the outgoing prompt. Switched core.inference to PEP 562
   __getattr__ lazy resolution so the stdlib-only HTML helper imports
   without dragging in the orchestrator or llama-server backend.

Regression coverage:
- tests/studio/test_gguf_singleton_shared.py asserts the
  routes.inference / core.inference.llama_cpp accessors return the
  same instance and that the VLM probe sees a route-loaded GGUF.
- tests/studio/test_extractor_semaphore_leak.py monkeypatches each
  multiprocessing failure point and asserts _EXTRACT_SEMAPHORE._value
  is restored.
- tests/studio/test_html_independent_of_inference.py runs in a fresh
  subprocess with poisoned core.inference.orchestrator/llama_cpp
  entries and asserts <script>/<style> are stripped.

Full suite (PR head + fixes): 987 passed / 14 pre-existing failures
unchanged (GPU/transformers/llama-server-binary dependent tests).
await disconnect_task

return StreamingResponse(
_ndjson_stream(),
Comment on lines +28 to +60
runs-on: macos-14
timeout-minutes: 25
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'

- name: Install backend test dependencies (CPU only)
run: |
python -m pip install --upgrade pip
pip install -r studio/backend/requirements/studio.txt
pip install \
python-multipart aiofiles sqlalchemy cryptography \
pyyaml jinja2 mammoth unpdf requests \
'numpy<3' pytest pytest-asyncio httpx
pip install --index-url https://download.pytorch.org/whl/cpu 'torch>=2.4,<2.11'
pip install 'transformers>=4.51,<5.5'

- name: PR-5351 document tests (macOS spawn semantics)
working-directory: studio/backend
env:
# macOS's default start method is spawn; exercise the same
# config users see in production.
UNSLOTH_STUDIO_EXTRACT_CONCURRENCY: '2'
run: |
python -m pytest -q tests/test_chat_document_extraction.py tests/test_chat_document_routes.py tests/test_inference_worker.py tests/test_vision_cache.py tests/test_anthropic_messages.py tests/test_openai_tool_passthrough.py tests/test_models_get_model_config_case_resolution.py --tb=short

- name: PR-5351 regression tests + cancel timing
run: |
python -m pytest -q tests/studio/test_extractor_semaphore_leak.py tests/studio/test_html_independent_of_inference.py tests/studio/test_gguf_singleton_shared.py tests/studio/test_stream_cancel_registration_timing.py --tb=short
Comment on lines +29 to +57
runs-on: ubuntu-latest
timeout-minutes: 20
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'

- name: Install backend test dependencies (CPU only)
run: |
python -m pip install --upgrade pip
pip install -r studio/backend/requirements/studio.txt
pip install \
python-multipart aiofiles sqlalchemy cryptography \
pyyaml jinja2 mammoth unpdf requests \
'numpy<3' pytest pytest-asyncio httpx
pip install --index-url https://download.pytorch.org/whl/cpu 'torch>=2.4,<2.11'
pip install 'transformers>=4.51,<5.5'

- name: PR-5351 document tests
working-directory: studio/backend
run: |
python -m pytest -q tests/test_chat_document_extraction.py tests/test_chat_document_routes.py tests/test_inference_worker.py tests/test_vision_cache.py tests/test_anthropic_messages.py tests/test_openai_tool_passthrough.py tests/test_models_get_model_config_case_resolution.py --tb=short

- name: PR-5351 regression tests + cancel timing
run: |
python -m pytest -q tests/studio/test_extractor_semaphore_leak.py tests/studio/test_html_independent_of_inference.py tests/studio/test_gguf_singleton_shared.py tests/studio/test_stream_cancel_registration_timing.py --tb=short
Comment on lines +29 to +59
runs-on: windows-latest
timeout-minutes: 30
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'

- name: Install backend test dependencies (CPU only)
shell: pwsh
run: |
python -m pip install --upgrade pip
pip install -r studio/backend/requirements/studio.txt
pip install python-multipart aiofiles sqlalchemy cryptography pyyaml jinja2 mammoth unpdf requests "numpy<3" pytest pytest-asyncio httpx
pip install --index-url https://download.pytorch.org/whl/cpu "torch>=2.4,<2.11"
pip install "transformers>=4.51,<5.5"

- name: PR-5351 document tests (Windows spawn semantics)
working-directory: studio/backend
shell: pwsh
env:
UNSLOTH_STUDIO_EXTRACT_CONCURRENCY: '2'
run: |
python -m pytest -q tests/test_chat_document_extraction.py tests/test_chat_document_routes.py tests/test_inference_worker.py tests/test_vision_cache.py tests/test_anthropic_messages.py tests/test_openai_tool_passthrough.py tests/test_models_get_model_config_case_resolution.py --tb=short

- name: PR-5351 regression tests + cancel timing
shell: pwsh
run: |
python -m pytest -q tests/studio/test_extractor_semaphore_leak.py tests/studio/test_html_independent_of_inference.py tests/studio/test_gguf_singleton_shared.py tests/studio/test_stream_cancel_registration_timing.py --tb=short
Comment thread studio/backend/core/inference/__init__.py Fixed
Comment thread tests/studio/test_extractor_semaphore_leak.py Fixed
Comment thread tests/studio/test_gguf_singleton_shared.py Fixed
Comment thread studio/backend/core/chat/document_extractor.py Fixed
Comment thread studio/backend/core/chat/document_extractor.py Fixed
Comment thread studio/backend/routes/inference.py Fixed
Comment thread studio/backend/routes/inference.py Fixed
Comment thread studio/backend/routes/inference.py Fixed
Comment thread studio/backend/routes/inference.py Fixed
Comment thread studio/backend/routes/inference.py Fixed
Comment thread studio/frontend/src/features/chat/utils/ocr-model-orchestrator.ts Fixed
Comment thread studio/frontend/src/features/chat/chat-settings-sheet.tsx Fixed
Comment thread studio/frontend/src/features/chat/chat-settings-sheet.tsx Fixed
Comment thread studio/frontend/src/features/chat/chat-settings-sheet.tsx Fixed
Comment thread studio/frontend/src/features/chat/chat-settings-sheet.tsx Fixed
Comment thread studio/frontend/src/features/chat/chat-settings-sheet.tsx Fixed
Comment thread studio/frontend/src/features/chat/chat-settings-sheet.tsx Fixed
Comment thread studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts Fixed
Real-world testing (Orimi test PDF, RFC 8259 PDF, "Attention Is All You
Need", calibre demo DOCX) plus an additional review pass surfaced five
follow-ups on top of the earlier singleton/semaphore/HTML fix:

1. Null-password PDFs were rejected as encrypted.
   The classic Orimi PDF, Acrobat distilled scans, and a long tail of
   PDFs in the wild carry a /Encrypt dict with an empty user password
   so the file opens without prompting. pypdf.PdfReader.is_encrypted
   and PyMuPDF's doc.is_encrypted both flag them, but the canonical
   "needs a password" signal is PyMuPDF's needs_pass. The preflight
   in routes.inference._preflight_pdf_page_count and the extractor in
   core.chat.document_extractor._extract_pdf now refuse only when
   needs_pass is True. pypdf's branch tries decrypt("") first and
   falls through to PyMuPDF on failure.

2. Worker put-result-then-die race.
   _run_extract_process_sync could observe proc.is_alive() == False
   after the worker had already queued a successful result, exit the
   loop with message=None, and surface a RuntimeError. Both the
   in-loop is_alive() branch and the post-join branch now perform a
   final result_queue.get_nowait() before declaring failure.

3. macOS multiprocessing start method.
   The ternary picked "fork" on macOS, which is unsafe with Quartz /
   PyObjC / PyMuPDF's CoreFoundation linkage. macOS now uses "spawn"
   like Windows; Linux keeps "fork" for the CoW pickling win.

4. NDJSON streaming InvalidStateError on shield-cancel race.
   The streaming NDJSON loop accepted extract_wait completion as a
   signal to call extraction_task.result(). When asyncio.shield's
   outer future was cancelled before the inner task finished, that
   raised InvalidStateError and surfaced as a generic HTTP 500.
   The branch now waits for extraction_task.done() and re-arms a
   fresh shielded future when only the outer wrapper completes.

5. PaddleOCR-VL nondeterministic inference defaults.
   Shipped temperature=1.5, min_p=0.1 -- causes hallucinated glyphs
   and reorderings on a closed-form transcription task. Aligned with
   the sibling OCR presets (DeepSeek-OCR, GLM-OCR) at temperature=0.0,
   top_p=1.0, top_k=-1, min_p=0.0.

Regression test additions:
- tests/studio/test_pseudo_encrypted_pdf.py mints a null-password PDF
  with PyMuPDF, asserts both the preflight and _extract_pdf accept it,
  and confirms a real password-required PDF still raises
  DocumentExtractionEncrypted.

Also drops importlib.reload from test_extractor_semaphore_leak.py: the
reload swapped _drain_future_exception out from under routes.inference,
breaking an existing identity assertion. The new fixture snapshots and
restores the semaphore counter instead.

Local: studio backend suite + 4 regression files: 91/91 PR-relevant
tests pass; the remaining 9 failures are the pre-existing
gpu_selection / kv_cache_estimation / help_output tests unchanged.
Mirrors the request-body hardening this PR already added to the
sibling JSON inference endpoints (/v1/chat/completions at :1674,
/v1/anthropic/messages at :2769, /v1/anthropic/messages_count at
:2850). /api/inference/cancel still used await request.json() with no
streaming cap, so an authenticated client could force the server to
buffer arbitrarily large bodies and slip past the exact overflow
hardening this PR added elsewhere.

Switched to _read_json_body_limited(request, max_bytes=64 KiB). The
real cancel payload is a small dict of identifiers (cancel_id,
completion_id, session_id, message_id); 64 KiB is generous and matches
the cap pattern used in the other authenticated route handlers.

Stream-cancel registration timing (test_stream_cancel_registration_timing
+ test_cancel_atomicity + test_cancel_id_wiring) is unchanged.
@danielhanchen danielhanchen force-pushed the pr-5351-cross-os-validation branch from 3093b18 to 13992ca Compare May 25, 2026 13:31
Comment thread tests/studio/test_extractor_semaphore_leak.py Fixed
Comment thread studio/backend/routes/inference.py Fixed
Comment thread studio/backend/routes/inference.py Fixed
Comment thread studio/backend/routes/inference.py Fixed
Comment thread studio/backend/routes/inference.py Fixed
Comment thread studio/backend/routes/inference.py Fixed
danielhanchen and others added 2 commits May 27, 2026 03:45
Reconcile 11 conflicted files spanning backend and frontend with 248
commits of main since the PR diverged:

backend
- studio.txt: keep PR additions (pypdf, python-multipart, onnxruntime)
  alongside main additions (cryptography, httpx)
- routes/inference.py: merge Pydantic model imports, keep cache_type_kv,
  combine trust_remote_code preflight with main's offline DNS guard,
  adopt main's _openai_messages_for_gguf_chat refactor, route through
  _normalize_openai_message_images, retain main's
  _drop_empty_assistant_sentinels and the is_vision kwarg
- tests/test_anthropic_messages.py: keep PR's test_image_count_limit
  alongside main's TestAnthropicRequestedStudioTools and
  TestAnthropicMessagesToolRouting suites

frontend
- types/api.ts: unify OpenAIMessageContentPart with reasoning,
  image_generation_call, and image_url.detail; retain OpenAIChatContentPart
  alias for prior callers
- types/runtime.ts: combine loadIn4Bit (PR) with fastMode (main)
- assistant-ui/attachment.tsx: combine PR's document chip and unique
  attachmentDomId with main's filename-in-aria-label
- assistant-ui/thread.tsx: keep PR's blocking-attachment send guard,
  thread the disabled state through main's ThreadComposerDock outside
  the viewport, drop the obsolete inline ComposerAnimated block,
  consolidate Composer to main's disabled/threadId API
- chat-settings-sheet.tsx: combine PR's DocumentExtractionSection and
  preset state-machine reset wiring with main's external connections
  flow, removing duplicated preset useState now sourced from the store
- chat/api/chat-adapter.ts: keep PR's appendContentParts/mergeAdjacentTextParts
  alongside main's reasoning normalizer, image-edit reference path,
  Anthropic refusal pruning, disabled-tool guard, and image gate
- chat/hooks/use-chat-model-runtime.ts: keep both cache_type_kv and
  storedReasoningEnabled hydration
- chat/runtime-provider.tsx: keep PR's document-extraction adapter,
  drop in-browser Text/Html/Docx adapters now handled by the backend,
  preserve main's listStoredChatMessages-based history and
  OpenDocumentAttachmentAdapter
- chat/shared-composer.tsx: combine PR's document upload + retry
  pipeline with main's image-availability gate and generalized-compare
  guard
- chat/stores/chat-runtime-store.ts: retain PR's DocExtractSettings,
  OcrPhase, and document support cache invalidation alongside main's
  preset-aware settings hydration, debounced saveSettingsPatch,
  beforeunload flush, scalar setting versions, last-external-checkpoint
  persistence, and maxTokens clamping
Drops studio-frontend-ci.yml, studio-inference-smoke.yml,
studio-tauri-smoke.yml, wheel-smoke.yml, release-desktop.yml and
stale.yml from this staging branch so the matrix stays below the
5-concurrent-Windows-runner cap. Keeps studio-backend-ci.yml as the
Ubuntu sanity baseline.

Adds three lanes that re-run the PR-5351 backend tests plus the
three regression tests added in the fix commit:

- pr5351-ubuntu.yml: ubuntu-latest, Python 3.11. CUDA spoof in
  tests/conftest.py engages on CPU runners.
- pr5351-macos.yml: macos-14 (arm64). Exercises the multiprocessing
  spawn start-method and the MLX branch in core.chat.vlm_capability.
- pr5351-windows.yml: windows-latest. Validates spawn + path
  normalisation + Process-construction-under-pressure (exactly the
  EAGAIN class the semaphore-leak fix protects against).

Each workflow gates on paths: studio/backend/**, tests/studio/**,
tests/conftest.py and its own file so unrelated commits do not
re-trigger.
@danielhanchen danielhanchen force-pushed the pr-5351-cross-os-validation branch from 13992ca to 7b6d399 Compare May 27, 2026 03:46
): void {
// Callers set must_change_password via setMustChangePassword(). Routing it
// through here would let CodeQL trace the boolean to localStorage and flag
// the (deliberate) JWT writes as sensitive-info storage.
let response = await postLogout(getAuthToken());
if (response && response.status === 401 && getRefreshToken()) {
const refreshed = await refreshSession();
if (refreshed) response = await postLogout(getAuthToken());
Adds a CPU end-to-end smoke that exercises:
  - the PR's `_extract_pdf` against an in-process synthetic PDF
  - llama-cpp-python (CPU build) loading Qwen2.5-0.5B-Instruct GGUF
  - inference on the extracted markdown with a ground-truth question

Runs on ubuntu-latest, macos-14, and windows-latest with no GPU.
Disables Metal on macOS and native autodetect on Windows/Linux so
the lanes stay strictly CPU.

Path-filtered to studio/backend/core/chat/, the test itself, and
each workflow file so unrelated commits don't re-trigger.
Comment on lines +26 to +52
runs-on: macos-14
timeout-minutes: 40
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'

- name: Install backend + llama-cpp-python (CPU build)
run: |
python -m pip install --upgrade pip
pip install -r studio/backend/requirements/studio.txt
pip install \
python-multipart aiofiles sqlalchemy cryptography \
pyyaml jinja2 mammoth pymupdf pymupdf4llm pytest pytest-asyncio \
pytest-timeout huggingface_hub requests numpy
# Disable Metal so the lane stays CPU-only; mirrors a no-GPU Mac.
CMAKE_ARGS="-DGGML_METAL=OFF -DGGML_ACCELERATE=OFF -DGGML_NATIVE=OFF" \
pip install --upgrade --quiet llama-cpp-python

- name: CPU inference on extracted document
env:
PR5351_LLAMA_THREADS: '3'
run: |
python -m pytest -q tests/studio/test_cpu_inference_on_extracted_document.py -s --tb=short
Comment on lines +28 to +53
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'

- name: Install backend + llama-cpp-python (CPU build)
run: |
python -m pip install --upgrade pip
pip install -r studio/backend/requirements/studio.txt
pip install \
python-multipart aiofiles sqlalchemy cryptography \
pyyaml jinja2 mammoth pymupdf pymupdf4llm pytest pytest-asyncio \
pytest-timeout huggingface_hub requests numpy
# CPU wheel ships pre-built on Linux; falls back to source if needed.
CMAKE_ARGS="-DGGML_NATIVE=OFF" pip install --upgrade --quiet llama-cpp-python

- name: CPU inference on extracted document
env:
PR5351_LLAMA_THREADS: '4'
run: |
python -m pytest -q tests/studio/test_cpu_inference_on_extracted_document.py -s --tb=short
Comment on lines +25 to +49
runs-on: windows-latest
timeout-minutes: 40
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'

- name: Install backend + llama-cpp-python (CPU build)
shell: pwsh
run: |
python -m pip install --upgrade pip
pip install -r studio/backend/requirements/studio.txt
pip install python-multipart aiofiles sqlalchemy cryptography pyyaml jinja2 mammoth pymupdf pymupdf4llm pytest pytest-asyncio pytest-timeout huggingface_hub requests numpy
$env:CMAKE_ARGS = "-DGGML_NATIVE=OFF"
pip install --upgrade --quiet llama-cpp-python

- name: CPU inference on extracted document
shell: pwsh
env:
PR5351_LLAMA_THREADS: '4'
run: |
python -m pytest -q tests/studio/test_cpu_inference_on_extracted_document.py -s --tb=short
Two corrections after the first run:
  - Point at Qwen/Qwen2.5-0.5B-Instruct-GGUF (the canonical Qwen
    team's repo); the unsloth/* fork at that name does not exist
    and returned 401 on all three runners.
  - Pass the required `max_figures`, `use_vlm_ocr`, and
    `max_visual_payloads` kwargs to `_extract_pdf`.

Verified locally on the merge tip:
  PYTHONPATH=studio/backend python -c '...' -> extracted 97 chars
  including the expected 'RFC 8259' / 'JSON' tokens.
"InferenceBackend",
"InferenceOrchestrator",
"get_inference_backend",
"get_llama_cpp_backend",
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants